[build] add Github Cache workflow and cancel-on-failure guard in bazel.yml#17575
[build] add Github Cache workflow and cancel-on-failure guard in bazel.yml#17575titusfortner wants to merge 2 commits into
Conversation
Review Summary by QodoAdd Github Cache workflow and cancel-on-failure guard for CI
WalkthroughsDescription• Add Github Cache workflow to pre-generate Bazel repository cache for macOS and Windows • Enable cache-save in gh-cache workflow with targeted build targets • Add cancel-on-failure guard to prevent poisoned cache saves in bazel.yml • Update workflow triggers to run on dependency file changes or daily schedule • Grant actions write permission for workflow cancellation capability Diagramflowchart LR
A["Dependency Changes<br/>or Daily Schedule"] -->|Trigger| B["Github Cache Workflow"]
B -->|Pre-generate Cache| C["macOS & Windows<br/>Repository Cache"]
D["Bazel Test Job"] -->|Use Cache| C
D -->|Failure Detected| E["Cancel Run Guard"]
E -->|Prevent Poisoned Cache| F["Skip Cache Save"]
File Changes1. .github/workflows/bazel.yml
|
Code Review by Qodo
1. Cache triggers miss lockfiles
|
There was a problem hiding this comment.
Pull request overview
Adds CI support to proactively populate and safely persist the Bazel repository cache on GitHub Actions, reducing failures caused by transient upstream download issues and avoiding cache “poisoning”.
Changes:
- Introduces a scheduled/paths-triggered
gh-cacheworkflow that runsbazel build --nobuildon macOS and Windows and saves the repository cache. - Grants
actions: writepermission to workflows that need to save/cache or cancel runs. - Adds a cancel-on-Bazel-failure step in the reusable
bazel.ymlworkflow to prevent saving a bad cache.
Note: after updating files in this repo, run (or have CI run) ./go format before merging to avoid formatter-related CI failures.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| .github/workflows/gh-cache.yml | New/updated workflow to warm and save Bazel repository cache for macOS and Windows on trunk/schedule. |
| .github/workflows/ci-rbe.yml | Grants actions: write permission so cache-save/cancel behavior can work on trunk. |
| .github/workflows/bazel.yml | Removes cache-version and adds a failure-triggered run-cancel guard intended to prevent poisoned cache saves. |
b79faaf to
21a1dae
Compare
|
Code review by qodo was updated up to the latest commit 21a1dae |
| push: | ||
| branches: [trunk] | ||
| paths: | ||
| - 'MODULE.bazel' | ||
| - 'rust/Cargo.lock' | ||
| - 'java/maven_install.json' | ||
| - 'py/requirements_lock.txt' | ||
| - 'rb/Gemfile.lock' | ||
| - 'dotnet/paket.lock' | ||
| - 'common/repositories.bzl' | ||
| - 'common/browsers.bzl' | ||
| schedule: |
There was a problem hiding this comment.
1. Cache triggers miss lockfiles 🐞 Bug ☼ Reliability
The gh-cache workflow’s push path filter omits pnpm-lock.yaml and multitool.lock.json, so updates to these Bazel dependency inputs won’t trigger cache population for macOS/Windows. Because MODULE.bazel uses these files to generate external repositories, CI may still need to download new artifacts (and hit the same transient 502/cert failures) until the next scheduled run.
Agent Prompt
### Issue description
`.github/workflows/gh-cache.yml` uses a `push.paths` filter to decide when to repopulate the GitHub cache, but it does not include key dependency inputs (`pnpm-lock.yaml`, `multitool.lock.json`) that are used by Bazel to generate external repositories. This means macOS/Windows cache population will not run when those files change, leaving caches stale.
### Issue Context
`MODULE.bazel` references both `//:multitool.lock.json` (rules_multitool hub) and `//:pnpm-lock.yaml` (npm_translate_lock). These files directly affect what external assets Bazel will download.
### Fix Focus Areas
- .github/workflows/gh-cache.yml[4-15]
### Proposed fix
Add the missing files to the `on.push.paths` list (at minimum `pnpm-lock.yaml` and `multitool.lock.json`). Consider also including other npm-related inputs referenced by `npm_translate_lock` (e.g. `package.json`, `pnpm-workspace.yaml`, `.npmrc`, and relevant `javascript/**/package.json`) if you want cache population to run immediately when those change.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Background
We've had frequent failures recently for not being able to download assets from 502 Gateway errors to certificate issues, etc.
One fix is to improve our cache so CI RBE workflow can use it instead of always downloading everything.
It has been disabled because it is quite large and Github limits us to 10GB.
I've already wired up a bunch of things in trunk already in order to verify that everything works; this PR is the final step and is as much for documentation the previous work.
The traditional repository-cache are the raw artifact downloads. In Bazel 7
--repo_contents_cachewas added which also stores extracted repos (somewhat duplicating what we have in setup-bazel with external cache).
Extracted repos are 2x the size of just the downloads and are not as useful on CI with one time use.
So I disabled
--repo_contents_cacheon the CI in 5f7df0d (and 9045e3b)This brought repository cache small enough to enable on RBE (8f3b261)
Also we're deleting all the codeql caches that keep adding up since github can manage these for us (312e586 & 6f5004c)
The other issue we've had is that windows and mac repos are generated by whichever language happens to run first after the last cache was evicted.
This PR uses a
Github Cacheworkflow to runbazel build --nobuildto generate Bazel repository cache for macOS and Windows.So now all jobs will pull the repository cache and be able to use anything inside it, and only 1 job per OS will save cache.
RBE job generates the repository cache every time regardless of what tests run so it might as well save what it has when it is done.
Github cache jobs will run on trunk whenever something likely to have changed a download is changed, or once a day.
setup-bazel action saves cache unless the job has been cancelled, and for these we don't want broken builds to overwrite good cache,
so I've added a cancel-on-failure guard in
bazel.ymlto prevent poisoned cache saves.Additional Considerations
Ideally we would toggle back on the
repo_contents_cacheand disable theexternal-cachein setup-bazel, but right now the total sizes of the windows/mac/linux repo caches would exceed 10GB.A few ways we could improve that, but this work should address the current primary concern.
🤖 AI assistance